{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d180958e-8362-4453-ad21-78ec618bc624",
   "metadata": {},
   "source": [
    "# Ordinal Encoding in Scikit-Learn"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11bdfc69-a04f-462e-8c14-c0b66dfd1796",
   "metadata": {},
   "source": [
    "## Defining some toy data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ccc1b22-8bbb-44e3-bb43-320d09d69cc0",
   "metadata": {},
   "source": [
    "- We start by defining some toy data here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "61f31b73-d486-4bc5-876a-86636c1acb86",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "02e244fd-76b0-430f-a002-4291d7d687e3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numerical</th>\n",
       "      <th>categorical</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.1</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.1</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.1</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.2</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6.1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7.1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8.1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1.2</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2.1</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>3.1</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>4.1</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    numerical categorical\n",
       "0         1.1           b\n",
       "1         2.1           b\n",
       "2         3.1           b\n",
       "3         4.2           b\n",
       "4         5.1           a\n",
       "5         6.1           a\n",
       "6         7.1           a\n",
       "7         8.1           a\n",
       "8         1.2           c\n",
       "9         2.1           c\n",
       "10        3.1           c\n",
       "11        4.1           c"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_1 = [\n",
    "    1.1, 2.1, 3.1, 4.2,\n",
    "    5.1, 6.1, 7.1, 8.1,\n",
    "    1.2, 2.1, 3.1, 4.1\n",
    "]\n",
    "\n",
    "feature_2 = [\n",
    "    'b', 'b', 'b', 'b',\n",
    "    'a', 'a', 'a', 'a',\n",
    "    'c', 'c', 'c', 'c'\n",
    "]\n",
    "\n",
    "df = pd.DataFrame({'numerical': feature_1, 'categorical': feature_2})\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6553be97-1aa2-4f59-a215-8328d43d812f",
   "metadata": {},
   "source": [
    "## Ordinal Encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfc7600c-373f-4db1-b8b6-6a6397554cba",
   "metadata": {},
   "source": [
    "- Usually, we use onehot encoding if we have categorical data without ordering information, so-called nominal data.\n",
    "- An example of such data is blood type (A, B, AB, or O)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f869142-1772-4f2e-b533-6fa95b68b44b",
   "metadata": {},
   "source": [
    "- Ordinal encoding is typically used if we have categorical data with ordering information.\n",
    "- One example of such data is T-shirt sizes (XS, S, M, L, or XL)\n",
    "- Now, assume that the \"categorical\" column above has ordered features; we can use the `OrdinalEncoder` to encode that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8df4f93d-ea27-45eb-bbc2-b9909a825217",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([['b'],\n",
       "       ['b'],\n",
       "       ['b'],\n",
       "       ['b'],\n",
       "       ['a'],\n",
       "       ['a'],\n",
       "       ['a'],\n",
       "       ['a'],\n",
       "       ['c'],\n",
       "       ['c'],\n",
       "       ['c'],\n",
       "       ['c']], dtype=object)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = df['categorical'].values.reshape(-1, 1)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d8516c98-fa86-4d8f-856e-b99eba083134",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.],\n",
       "       [1.],\n",
       "       [1.],\n",
       "       [1.],\n",
       "       [0.],\n",
       "       [0.],\n",
       "       [0.],\n",
       "       [0.],\n",
       "       [2.],\n",
       "       [2.],\n",
       "       [2.],\n",
       "       [2.]])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "\n",
    "ode = OrdinalEncoder(\n",
    "    categories= [['a', 'b', 'c']]\n",
    ")\n",
    "\n",
    "ode.fit_transform(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af60691c-18eb-4059-894d-868834631887",
   "metadata": {},
   "source": [
    "- Notice that based on the alphabetical ordering the ordinal encoder assumes that `'a: 0 < b: 1 < c: 2'`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5234dc5f-1102-4ec6-8e37-8a51a511f745",
   "metadata": {},
   "source": [
    "- If we want to change that and have an ordering assumption like `'b: 0 < a: 1 < c: 2'`, we can override the feature ordering via the `categories` attribute as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "af38a3ba-0d0d-42ee-aad3-0c337ca74b4f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.],\n",
       "       [0.],\n",
       "       [0.],\n",
       "       [0.],\n",
       "       [1.],\n",
       "       [1.],\n",
       "       [1.],\n",
       "       [1.],\n",
       "       [2.],\n",
       "       [2.],\n",
       "       [2.],\n",
       "       [2.]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ode = OrdinalEncoder(\n",
    "    categories= [['b', 'a', 'c']]\n",
    ")\n",
    "\n",
    "ode.fit_transform(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "542c184b-8102-467f-8bb0-dea1089be84f",
   "metadata": {},
   "source": [
    "## Using the `OrdinalEncoder` when other columns are present"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c02cb0ca-8261-4d11-baa3-8c7a8e5f37a8",
   "metadata": {},
   "source": [
    "- Below is an example using a `ColumnTransformer` to transform only specific columns via the `OrdinalEncoder` when multiple columns are present.\n",
    "- For instance, considering the toy dataset at the top, assume we only want to transform the \"categorical\" column but not the \"numerical\" column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b675a433-bbb5-4bf4-a011-746f94cd8e4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sklearn\n",
    "from sklearn.compose import ColumnTransformer\n",
    "\n",
    "\n",
    "ohe = OrdinalEncoder()\n",
    "\n",
    "X = df.values\n",
    "categorical_features = [1]\n",
    "\n",
    "col_transformer = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('cat', ohe, categorical_features)],\n",
    "         remainder='passthrough'\n",
    ")\n",
    "\n",
    "col_transformer.fit(df)\n",
    "X_t = col_transformer.transform(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68a15566-b874-403f-8547-82922808980a",
   "metadata": {},
   "source": [
    "- Note that there are a few extra workaround like the `FloatTransformer()`, which are explained [here](sklearn-onehot-encoding-mixedtype-df.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9db9813e-e780-4316-a7b6-833ecb719634",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1. , 1.1],\n",
       "       [1. , 2.1],\n",
       "       [1. , 3.1],\n",
       "       [1. , 4.2],\n",
       "       [0. , 5.1],\n",
       "       [0. , 6.1],\n",
       "       [0. , 7.1],\n",
       "       [0. , 8.1],\n",
       "       [2. , 1.2],\n",
       "       [2. , 2.1],\n",
       "       [2. , 3.1],\n",
       "       [2. , 4.1]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "4e350f04-0c4f-409f-96c5-b5aa915f20a3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sklearn: 1.0.2\n",
      "pandas : 1.4.0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark --iversions"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 64-bit ('base': conda)",
   "language": "python",
   "name": "python392jvsc74a57bd0249cfc85c6a0073df6bca89c83e3180d730f84f7e1f446fbe710b75104ecfa4f"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
